How to do feature selection for text data?
Is PCA a FS method for text?
Other methods?
How to do feature selection for text data?
Is PCA a FS method for text?
Other methods?
The additional features typically add noise. Machine learning will pick up on spurious correlations, that might be true in the training set, but not in the test set.
For some ML methods, more features means more parameters to learn (more NN weights, more decision tree nodes, etc…)
The increased space of possibilities is more difficult to search.
Why we need FS:
1. To improve performance (in terms of speed, predictive power, simplicity of the model). 2. To visualize the data for model selection. 3. To reduce dimensionality and remove noise.
Feature Selection is a process that chooses an optimal subset of features according to a certain criterion.
Feature selection is the process of selecting a specific subset of the terms of the training set and using only them in the classification algorithm.
Let \(p(c | t)\) be the conditional probability that a document belongs to class \(c\), given the fact that it contains the term \(t\). Therefore, we have:
\[\sum^k_{c=1}{p(c | t)=1}\]
Then, the gini-index for the term \(t\), denoted by \(G(t)\) is defined as:
\[G(t) = \sum^k_{c=1}{p(c | t)^2}\]
The value of the gini-index lies in the range \((1/k, 1)\).
Higher values of the gini-index indicate a greater discriminative power of the term \(t\).
If the global class distribution is skewed, the gini-index may not accurately reflect the discriminative power of the underlying attributes.
âž” normalized gini-index
\({\chi}^2\) statistics with multiple categories
\({\chi}^2=\sum_c{p(c) {\chi}^2(c,t)}\)
\({\chi}^2(t) = \underset{c}{max}\ {\chi}^2(c,t)\)
Many other metrics (Same trick as in \(\chi^2\) statistics for multi-class cases)
Mutual information
\[PMI(t;c) = p(t,c)log(\frac{p(t,c)}{p(t)p(c)})\]
Odds ratio
\[Odds(t;c) = \frac{p(t,c)}{1 - p(t,c)} \times \frac{1 - p(t,\bar{c})}{p(t,\bar{c})}\]
\[\min_\alpha{\hat{R}(\alpha, \sigma)} = \min_\alpha{\sum_{k=1}^m{L(f(\alpha, \sigma \circ x_k), y_k) + \Omega(\alpha)}}\]
Replace the regularizer \(||w||^2\) by the \(l_0\) norm \(\sum_{i=1}^n{1_{w_i \neq 0}}\)
Further replace \(\sum_{i=1}^n{1_{w_i \neq 0}}\) by \(\sum_i{log{(\epsilon + |w_i|)}}\)
Boils down to the following multiplicative update algorithm:
PCA is one of the most common feature reduction techniques
A linear method for dimensionality reduction
Allows us to combine much of the information contained in \(n\) features into \(p\) features where \(p < n\)
PCA is unsupervised in that it does not consider the output class/value of an instance – There are other algorithms which do (e.g. Linear Discriminant Analysis)
PCA works well in many cases where data have mostly linear correlations
\[var(X) = \frac{\sum_{i = 1}^n{(X_i - \bar X)(X_i - \bar X)}}{(n - 1)}\] \[cov(X,Y) = \frac{\sum_{i = 1}^n{(X_i - \bar X)(Y_i - \bar Y)}}{(n - 1)}\]